• Wednesday, September 4, 2024

    This paper introduces a stochastic layer-wise shuffle regularization technique to overcome overfitting in Vision Mamba models, enabling them to scale up to 300M parameters while maintaining competitive performance with Vision Transformers (ViTs).

  • Wednesday, May 15, 2024

    Researchers investigated the Mamba architecture, typically used for tasks with long-sequence and autoregressive characteristics, and its application in vision tasks, and found that while Mamba is not effective for image classification, it shows promise in detection and segmentation tasks that do.

  • Thursday, March 28, 2024

    Databrix and Mosaic have trained a 132B parameter MoE model with impressive performance. They trained the model on 3,000 H100s and have released the weights. The model is also available on the Databricks API.

  • Friday, March 29, 2024

    Mamba is a style of model that is supposed to beat Transformers in efficiency while matching performance. Jamba is a novel variant that includes MoE layers. It can run at 1.6k tokens per second with a context length of 128k tokens. It achieves 67% on the MMLU benchmark. The weights are available.

  • Wednesday, March 20, 2024

    Researchers have developed a new framework to help vision-language models learn continuously without forgetting previous knowledge using a system that expands the model with special adapters for new tasks.

  • Thursday, June 20, 2024

    Microsoft has released an MIT-licensed set of small VLMs that dramatically outperform much larger models on captioning, bounding, and classification.

  • Monday, March 11, 2024

    The powerful DeepSpeed training library from Microsoft has an update that allows models to use 6 bits per parameter. This can speed up inference well over 2x.

  • Wednesday, March 13, 2024

    VideoMamba is a solution that addresses the complexities of video understanding by efficiently managing local redundancy and global dependencies.

  • Thursday, May 9, 2024

    Microsoft is developing a new AI model named MAI-1, which reportedly boasts about 500 billion parameters, aiming to surpass other major AI models by Google and OpenAI.

  • Monday, April 22, 2024

    50 vision/language datasets combined into a single format to allow for improved training of models.

  • Thursday, September 26, 2024

    Llama 3.2 has been introduced as a significant advancement in edge AI and vision technology, featuring a range of open and customizable models designed for various applications. This release includes small and medium-sized vision large language models (LLMs) with 11 billion and 90 billion parameters, as well as lightweight text-only models with 1 billion and 3 billion parameters. These models are optimized for deployment on edge and mobile devices, making them suitable for tasks such as summarization, instruction following, and rewriting, all while supporting a context length of 128,000 tokens. The vision models are designed to excel in image understanding tasks, providing capabilities such as document-level comprehension, image captioning, and visual grounding. They can process both text and image inputs, allowing for complex reasoning and interaction with visual data. For instance, users can query the model about sales data represented in graphs or seek navigational assistance based on maps. The lightweight models, on the other hand, focus on multilingual text generation and tool-calling functionalities, enabling developers to create privacy-focused applications that operate entirely on-device. Llama 3.2 is supported by a robust ecosystem, with partnerships established with major technology companies like AWS, Databricks, and Qualcomm, ensuring that the models can be easily integrated into various platforms. The release also includes the Llama Stack, a set of tools designed to simplify the development process across different environments, including on-premises, cloud, and mobile devices. The models have undergone extensive evaluation, demonstrating competitive performance against leading foundation models in both image recognition and language tasks. The architecture of the vision models incorporates new adapter weights that allow for seamless integration of image processing capabilities into the existing language model framework. This innovative approach ensures that the models maintain their text-based functionalities while expanding their capabilities to include visual reasoning. In addition to the technical advancements, Llama 3.2 emphasizes responsible AI development. New safety measures, such as Llama Guard, have been introduced to filter inappropriate content and ensure safe interactions with the models. The lightweight versions of the models have been optimized for efficiency, making them more accessible for deployment in constrained environments. Overall, Llama 3.2 represents a significant leap forward in the field of AI, promoting openness and collaboration within the developer community. The models are available for download and immediate development, encouraging innovation and the creation of new applications that leverage the power of generative AI. The commitment to responsible AI practices and the continuous engagement with partners and the open-source community highlight the potential for Llama 3.2 to drive meaningful advancements in technology and society.

  • Wednesday, March 13, 2024

    A sequence prediction model for DNA built on the Transformer competitor Mamba. It is extremely efficient and powerful for a small model.

  • Wednesday, June 26, 2024

    Microsoft's new many-to-many vision model can be tuned for specific downstream tasks. It isn't quite as powerful as PaliGemma, but is easy to run in PyTorch.

  • Friday, October 4, 2024

    The article discusses the development and performance of a set of tiny test models trained on the ImageNet-1k dataset, created by Ross Wightman and published on Hugging Face. These models represent various popular architecture families and are designed for quick verification of model functionality, allowing users to download pretrained weights and run inference efficiently, even on less powerful hardware. The models are characterized by their smaller size, lower default resolution, and reduced complexity, typically featuring only one block per stage and narrow widths. They were trained using a recent recipe adapted from MobileNet-v4, which is effective for maximizing accuracy in smaller models. While the top-1 accuracy scores of these models may not be particularly impressive, they are noted for their potential effectiveness in fine-tuning for smaller datasets and applications that require reduced computational resources, such as embedded systems or reinforcement learning tasks. The article provides a detailed summary of the models' performance metrics, including top-1 and top-5 accuracy scores, parameter counts, and throughput rates at a resolution of 160x160 pixels. The results indicate that the models, while small, can still achieve reasonable accuracy levels, with some models performing better at a slightly higher resolution of 192x192 pixels. Additionally, the article outlines the throughput performance of the models when compiled with PyTorch 2.4.1 on an RTX4090 GPU, showcasing the number of inference and training samples processed per second under different compilation modes. This data highlights the efficiency of the models in terms of speed, which is crucial for real-time applications. The article also delves into the unique architectural variations of the models, providing insights into their design and the specific components used in each. For instance, the ByobNet combines elements from EfficientNet, ResNet, and DarkNet, while the ConvNeXt models utilize depth-wise convolutions and different activation functions. The EfficientNet models are noted for their use of various normalization techniques, including BatchNorm, GroupNorm, and LayerNorm. Overall, the article invites the community to explore potential applications for these tiny test models beyond mere testing, emphasizing their versatility and the innovative approaches taken in their design.

  • Monday, March 18, 2024

    xAI has released the weights and architecture of its 314 billion parameter Mixture-of-Experts model, Grok-1. It is written in JAX and uses a modern Transformer architecture with GeGLU, ROPE, sandwich Norm, and other niceties.

    Hi Impact
  • Wednesday, April 17, 2024

    Vision Language Models (vLLMs) often struggle with processing multiple queries per image and identifying when objects are absent. This study introduces a new query format to tackle these issues, and incorporates semantic segmentation into the training process.

  • Monday, March 11, 2024

    Last week, a breakthrough was made in training large models on small GPUs. This config shows how to use these technologies to train Mixtral on consumer hardware.

  • Friday, June 7, 2024

    The Together AI team has a novel VLM that excels at extremely high resolution images due to its efficient architecture.

  • Wednesday, October 2, 2024

    The paper titled "MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning" introduces a new family of multimodal large language models (MLLMs) aimed at improving capabilities in various areas such as text-rich image understanding, visual referring and grounding, and multi-image reasoning. This work builds on the previous MM1 architecture and emphasizes a data-centric approach to model training. The authors systematically investigate the effects of diverse data mixtures throughout the model training lifecycle. This includes the use of high-quality Optical Character Recognition (OCR) data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. The models developed range from 1 billion to 30 billion parameters and include both dense and mixture-of-experts (MoE) variants. The findings suggest that with careful data curation and training strategies, strong performance can be achieved even with smaller models, specifically those with 1B and 3B parameters. Additionally, the paper introduces two specialized variants of the MM1.5 model: MM1.5-Video, which is tailored for video understanding, and MM1.5-UI, designed for mobile user interface understanding. Through extensive empirical studies and ablation experiments, the authors provide detailed insights into the training processes and decisions that shaped their final model designs. This research offers valuable guidance for future developments in multimodal large language models, highlighting the importance of data quality and training methodologies in achieving effective model performance. The paper was submitted on September 30, 2024, and is categorized under subjects such as Computer Vision and Pattern Recognition, Computation and Language, and Machine Learning. The authors express gratitude for the support received from various institutions and contributors, indicating a collaborative effort in advancing the field of multimodal learning.

  • Friday, March 22, 2024

    Meta Reality Labs has trained a model that takes visual input and translates it into a 3D representation of a scene. The 70m parameter model runs quickly on-device and exhibits extreme stability.

  • Wednesday, June 26, 2024

    Imbue has trained and released an extremely powerful 70B language model. It uses Imbue's custom optimizer and some great data filtering techniques. The model was trained with zero loss spikes.

    Hi Impact
  • Thursday, April 4, 2024

    Researchers have developed DiJiang, a new approach that transforms existing Transformers into leaner, faster models without the heavy burden of retraining.

  • Thursday, April 25, 2024

    Microsoft has released a set of GPU accelerated kernels for training BitNet style models. These models have substantially lower memory cost without much drop in accuracy.

  • Monday, April 15, 2024

    xAI has announced that its latest flagship model has vision capabilities on par with (and in some cases exceeding) state-of-the-art models.

    Hi Impact
  • Thursday, September 12, 2024

    French AI startup Mistral has launched Pixtral 12B, a 12-billion-parameter multimodal model capable of processing both images and text. Available via GitHub and Hugging Face, the model can be fine-tuned and used under an Apache 2.0 license. Its release follows Mistral's $645 million funding round and positions the company as a significant player in Europe's AI landscape.

  • Thursday, September 26, 2024

    Llama 3.2 is the latest iteration of an open-source AI model family designed for versatility and efficiency in various applications. This release offers a range of model sizes, including 1B, 3B, 11B, and 90B parameters, catering to different needs from lightweight mobile applications to more complex multimodal tasks that involve both text and image processing. The 1B and 3B models are optimized for on-device applications, making them suitable for tasks like summarizing discussions or integrating with tools such as calendars. In contrast, the 11B and 90B models are designed for more demanding multimodal applications, capable of processing high-resolution images and generating relevant text outputs. Llama 3.2 emphasizes a streamlined developer experience through the Llama Stack, which provides a comprehensive toolchain for building applications. Developers can choose from popular programming languages like Python, Node, Kotlin, and Swift, allowing for rapid development and deployment across various environments, including on-premises and edge devices. The common API facilitates interoperability, reducing the need for extensive model-level changes and accelerating the integration of new components. Performance evaluations of Llama 3.2 have been conducted across over 150 benchmark datasets, demonstrating its capabilities in both language understanding and visual reasoning. The model has shown competitive results against other leading models in real-world scenarios, further solidifying its position in the AI landscape. The Llama ecosystem has seen significant growth, with over 350 million downloads on platforms like Hugging Face, highlighting its popularity and the support from partners such as ARM, MediaTek, and Qualcomm, which enable the deployment of lightweight models on mobile and edge devices. Companies like Dell are also integrating Llama Stack into their offerings, promoting the adoption of open models in enterprise settings. Real-world applications of Llama 3.2 are already being showcased by various organizations. For instance, Zoom has developed an AI companion that enhances productivity through chat and meeting summaries, while DoorDash utilizes Llama to streamline internal processes. Additionally, KPMG has explored secure open-source LLM options for financial institutions, demonstrating the model's versatility across different industries. Overall, Llama 3.2 represents a significant advancement in the field of AI, providing developers with powerful tools to create efficient, customizable applications while fostering a collaborative community around open-source AI technologies.

  • Friday, April 19, 2024

    Meta has released an 8B and 70B model with dramatically improved performance, particularly in reasoning, context length, and code. It is still training a 400B parameter model, which will match Opus in performance. These models are easily the most powerful available open models.

  • Friday, March 22, 2024

    Sakana AI creates state-of-the-art Japanese language, vision, and image generation models. It introduced an evolutionary model merge that aims to evolve foundation models without expensive pretraining. The model merge has been released along with an explanation of the method.

  • Monday, April 22, 2024

    A visual guide to Vision Transformers (ViTs), a class of deep learning models that have achieved state-of-the-art performance on image classification tasks.

  • Thursday, August 15, 2024

    Nvidia has released its Llama 3.1 minitron 4B model. The model scored 16% better on MMLU compared with training from scratch by using knowledge distillation and pruning and required 40x fewer tokens.